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Abstract — A simple proof for the Shannon coding theorem, 
using only the Markov inequality, is presented. The technique 
is useful for didactic purposes, since it does not require many 
preliminaries and the information density and mutual informa- 
tion follow naturally in the proof. It may also be applicable to 
situations where typicality is not natural. 



I. Introduction 

Shannon's channel coding theorem (achievability) for mem- 
ory less channels was originally proven based on typicality (TJ, 
which is formalized in today's textbooks 121 by the asymptotic 
equipartition property (AEP). The way information theory is 
introduced in most textbooks and graduate courses, requires 
one to first get acquainted with the concepts of entropy, mutual 
information, typicality, etc, before being able to understand 
this proof. Gallager |3| proposed a different proof leading 
also to the achievable error exponent. An alternative proof for 
DMC-s was given by Csiszar and Korner using the method 
of types J4] §IV]. Less known is Shannon's proof [5| based 
on what is nowadays termed information density, which also 
uses elementary tools. 

In recent works the Markov inequality was used to analyze 
the performance of rateless codes [6] and universal decoding 
schemes [7|. The underlying technique is remarkably simple, 
especially when applied the memoryless channel. In addition 
to Markov inequality, the proof only uses basic probability 
laws and the law of large numbers. This technique may be 
already known to some, but was never published, so it seems 
worthwhile to do so. 

II. A PROOF OF THE CODING THEOREM 

Random variables are denoted by capital letters and vec- 
tors by boldface. When applying a single-letter distribution 
to a vector it is implicity extended i.i.d., i.e. P X (X") = 

The Markov inequality simply states that for a non-negative 
random variable A, 



Pr{A >t}< 



E[A] 
t 



(1) 



and is easily proven by taking the expected value over the 
relation Ind(^4 > t) < ~ (where Ind(-) denotes an indicator 
function). 

As in the standard proof, the code is a random code 
where each letter of each codeword is drawn i.i.d. with the 
distribution P X (X). The standard claim that the existence 
of deterministic capacity achieving codes results from the 
existence of random codes is applied. After seeing the channel 
output vector Y, the receiver applies maximum likelihood 



decoding and chooses the codeword X which maximizes 
P Y | X (Y|X) (breaking ties arbitrarily). Note that the decoding 
metric P Y | X (Y|X) does not depend on the specific code 
chosen. 

Now, fix the transmitted and the received words X, Y 
(respectively) and ask what is the pairwise error probability 
over the ensemble where the other codeword X m m = 
1, . . . , 2 nR - 1 is independent of X, Y and distributed P x ( ). 
Denote by E m the event that the codeword X m attains a higher 
a-posteriori probability, i.e. that P Y | X (Y|X m ) > P Y | X (Y|X). 
Then 

Pr{P m |X,Y} =Pr{P Y|x (Y|X m ) > P Y|X (Y|X) |X, Y} 

Markov E [P Y|X ( Y |X m ) | X, Y] 

iV,x(Y|X) 
]T P Y|x (Y|x m )P x (x m ) 



JV,x(Y|X) 



PAY) 
fv,x(Y|X)- 



(2) 



By the union bound, the probability of error conditioned on 
X, Y is bounded as: 



P e \x,y < Pi- 



ll K » 



X,Y 



< 2' li? -Pr{P m |X,Y} 

< t r PAY) 



(3) 



*V(Y|X) 

Next, the behavior of this conditional error probability P e \ x ,y 
is analyzed for the memoryless channel. By the law of large 
numbers: 



1 PAY) 
n ° g P Y|X (Y|X) 



1 ™ 

-Y]iog 

n — ' 



n 

i=l 
in Prob. OXN) 
n— f oo 



E 



PAY t ) 
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log 
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(4) 



A 



-I(X;Y), 



where X, Y are two random variables distributed according 
to P x (^0 ■ P Y |x(Y|X). If mutual information has not been 
defined, then the last equality may be considered its definition. 
From the L.L.N, it holds that for any e, 6 > there is n large 
enough such that with probability at least 1 — e (the probability 
is over X, Y): 



1 



loe 



PAY) 
iV,x(Y|X) 



< -I(X;Y) 



(5) 



2 



the conditional probability of 
Y )- s - R ) and thus the overall 



When (|5]) holds, then by |3 
error is bounded by 2 -n ( i ^ 
error probability is bounded by the union bound 

P e < e + 2-»(J(*^)-<s--R> 



(6) 



which can be made arbitrarily small if R < I(X; Y), since e, 5 
can be arbitrarily small. This proves I(X\Y) is an achievable 
rate (by standard definitions, e.g. |__ §7.5]), and the capacity 
is attained by optimizing over P x (-)- '-' 
The same proof applies to continuous channels, i.e. P X y( - ) 
may denote a probability density of continuous variables rather 
than a probability mass function, and the expression for mutual 
information directly translates into the continuous expression 
(difference between differential entropies). 

III. The Normal approximation and channel 

DISPERSION 

It is simple and instructive to continue the argument above 
and develop the well known Normal approximation for the gap 
from capacity required for a certain error probability. This is 
a well known result attributed to Strassen and tightened by 
Polyanskiy et al |__ Thm.45]. The technique is not new, and 
the point is to see how it evolves naturally from the previous 
steps. In the following mathematical details are waved aside. 

Recognizing that the term p~~ff^n m * s 2 - ' where i 
is the information density (a function of X, Y), rewrite (|3]l as 
Pe\xy < 2 nR ~ % , Replacing the weak LLN argument used in 
Q for this term, by the strong LLN, implies that - converges 
in distribution to the Gaussian distribution Af (I, — ), where / 
is the mutual information (the normalized mean of i) and V 
is the dispersion, i.e. V = Var ^log p ^^y\x) j ■ 

Ignoring the overhead terms related to this convergence and 
assuming that indeed - ~ J\f (/, then: 

P e = E [P e{xy ] 

<l-Pr{P elxy >2- nS } 

+ 2 -nS . Pr {p e|icj/ <2-» 5 } 



,nR—i 



> 2- nS }+2- 



(7) 



Pt{- <R + d 
n 
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where that 6 is a parameter of choice. Requiring that the RHS 
equal a desired error probability e, the following rate R is 
extracted from (FT): 

R = I-S-J-Q- 1 (e~2- nS ). (8) 
V n 

By letting S decrease slower than - but faster than -k=, 
the term 2~ nS can be made negligible compared to e while 
5 becomes negligible compared to y l /V/nQ~ 1 (e), and since 
Q(-) is continuous the following well known approximation 
is obtained: 

-V?Q- X (e). (9) 
V n 



IV. Discussion 

The techniques used in the proofs above are all well known. 
The only new technique is the use of the Markov bound in 
Q. As can be seen in the result of Q is equivalent to a 
theorem by Shannon _f] Thm.l ]_2 Thm.2]. Shannon showed 
this bound is tight in terms of rate [5 , Thm.2]. Many results in 
coding can be obtained from this bound, or from its stronger 
versions such as Feinstein's Lemma (8l Thm.l] and results by 
Polyanskiy et al |__ Lemma 19]. Therefore the main technical 
contribution of this paper is in supplying a short proof of 
Shannon's theorem [5 Thm.l] by |2|-([3]). 

It may seem surprising that tight bounds can be obtained 
by Markov inequality. First note, that for analysis of error 
probability near the channel capacity, the tightness of the 
bound on P e \ xy is not critical. The error probability is typically 
bounded (as in |7])) by two components: the probability of 
the normalized information density i/n to fall below the rate 
R (the probability of a "bad empirical channel"), and the 
remaining error probability when i/n is above the rate (the 
error probability in a "good empirical channel"). Near the 
capacity, the first probability is dominant (as evident from the 
previous section), and therefore, roughly speaking, any bound 
on the error that vanishes when i/n > R + 6 is satisfactory. 
To compare with the methods used in Shannon's proof 
Thm.l], first use Bayes rule to reformulate the pairwise error 
condition P Y|X (Y|X ro ) > P Y , X (Y|X) as > 2 l = 

fk(x) • T ne LHS is the ratio between two probability 
distributions on X. Given two distributions P, Q, Shannon's 
argument is based on the fact that the probability under 
distribution P of the set |x : > t\ cannot be larger than 

|, which is obtained by summing both sides of P(x) < \Q{x) 
over the set. Using Markov inequality yields the same bound 



because E 

p 



P(x) 



R 



= 1. This explains the fact the two bounding 

techniques yield the same bound in OJ. Markov inequality 
yields a simpler derivation and the information density, and 
subsequently the mutual information, follow naturally from 
the bound. 

Note that in using the L.L.N, in Q the relevant r.v. is 
required to have a bounded variance. This assumption appears 
also in the AEP based proof. 
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